Data Cleaning

Clean the artist dataset:

  1. There are too many not useful columns like various urls so drop them first. Also, each artist has 8 columns of genre, to keep it consistent with other datasets, drop the genre columns and only keep the priority genre. However, the priority genre is either stored in the genres1 column or the genres column, depending on how many genres the artist have, so here I only drop the genres2 to genres8 columns.

    Column names before dropping:

    before Columns after dropping:

    after
  2. Combine the genre1 column and genre column to create the final genre column for each artist.

    Before cleaning (the green and red colored words are the genres, this only shows a part of the genres since it got 8 columns of genre for each artist):

    before After cleaning:

    after
  3. It can be observe from the previous picture a lot of artist has no genre information. Fill these missing values in the genre field using the word ‘undefined’.

    After filling the missing values:

    after
  4. Map the specific genre to a broader genre to reduce the number of genres. This step was done similarly to the last dataset.

    Before mapping (there are 75 genres in total, here I onnly shows a part of them):

    before After mapping:

    after

Here is a sample of the cleaned dataset:

Link to the cleaned dataset: spotify_artist.csv

Clean lyrics dataset:

  1. Merge all the lyrics dataset (genius_lyrics_2022.csv, genius_lyrics_global.csv, genius_lyrics.csv) into one data frame.

  2. By checking, there are duplicates exists, dropping all the duplicates, since there are duplicates in different trending list. Print the duplicates rows before dropping: before After dropping duplicates, use df.duplicated().sum() to check if there are any duplicates, the result should be 0.

  3. Drop the rows with missing lyrics, since it will serve no use in analysis.

    Before dropping: before

    After dropping, use df.isnull().sum() to check if there are any missing values, the result should be 0.

  4. Remove the indication or introduction before each song’s lyric using regular expression.

    Before removing:

    before

    After removing (notice the [chours] and [verse] are gone, also the lyric’s information like contributors is removed as well):

    after
  5. Remove all the unprintable and puncuation and non-English characters since it will not be used in the analysis.

    Before removing:

    before

    After removing(notice the lyric only contains non-English characters are now empty):

    In this picture, we can see all the punctuations are gone: after

  6. Drop all the new empty rows using df.dropna().

  7. Create a bag of word from the lyrics column using CountVectorizer and use stop_words='english' parameter to remove the stop words and drop the original lyric column.

Here is a sample of the cleaned dataset:

Link to the cleaned dataset: genius_lyrics_cleaned

Cleaning TikTok dataset

  1. Check for missing value using df.isnull().sum(), there are no missing values in this dataset as shown in the picture below:

    missing
  2. Check for duplicate value using duplicates = df_[df_tik.duplicated()], it will return a empty dataframe, showing there are no duplicates in this dataset.

  3. Drop not useful columns, album name and artist popularity.

    Before dropping:

    before

    After dropping, notice the colums dropped are gone:

    after Rename the columns for better data retrieval. After rename the colums, the columns names are like this:

    after

Here is a sample of the cleaned dataset:

Link to the cleaned dataset: tiktok_cleaned.csv

Clean last.fm listening events dataset

  1. The listening events in the dataset are too many, 30,357,786 columns in total, which could cause trouble in analysis. So here we only randomly sample 10000 listening events from the dataset (This is just an initial decision, it could chagne as the analysis went on)

  2. The columns in the dataset are squashed into one column, so i need to split it by the separator ’.

    Before splitting:

    before

    After splitting, notice the columns are now separated into four columns:

    after
  3. Check if there are any missing values in the sample dataset using colSums(is.na(df)), in this case there are no missing values in the sample dataset.

Here is a sample of the cleaned dataset:

Link to the cleaned dataset: listening_events_sample.csv

Clean last.fm track datasets

  1. The data itself is encode in tsv format, but the separator is not consistent where sometime it is tab and sometimes it is spaces. So it is impossible to read directly using pandas. First, I read the file as plain text and change all the separator to comma, then save the result as a .csv file.

    Before changing the separator: before

    After changing the separator and save as .csv file:

    after
  2. There are a lot of columns in the track name are clearly unreadable, consisting of ‘!’ Or simply dots and white spaces or something like Remove all the columns only consist with these with regular expression.

    Before removing: before After removing (notice the meaningless columns are turned to NA): after

  3. Drop all the rows with missing values and unprintable vlaues and write the cleaned dataset to a new csv file.

A sample of the cleaned dataset:

Link to the cleaned dataset: tracks_cleaned.csv

Clean last.fm user dataset

  1. Plenty of users’s country information is missing, I replace all the empty value using the word ‘unknown’

    Data before cleaning: before

  2. It is worth noticing that there are a lot of -1 ages in this dataset. It consists of 0.3% of the data. With so much missing value, whether chaning to the mean or media will introduce huge change to the variance of the data (variance dropping 23 if using media, more if using mean). So I decided the replacing of the missing value need to be done after more analysis in the next stage.

  3. Also from the distribution we can see that there are very few data has age over 80, so remove them as well. This is about 0.03% of the data.

    Here is the distribution of the age in the dataset: distribution

  4. Using bining to create a 5 years range bin in the age values for better analysis.

    Before binning, the frequency of the age is like this: before

    After binning: after

Here is a sample of the cleaned dataset:

Link to the cleaned dataset: users_cleaned.csv